{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "a5fA5qAm5Afg"
   },
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "You can run either this notebook locally or on Google Colab.\n",
    "\n",
    "Instructions for setting up Colab are as follows:\n",
    "1. Open a new Python 3 notebook.\n",
    "2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n",
    "3. Optional: Restart the runtime (Runtime -> Restart Runtime) for any upgraded packages to take effect\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **_NOTE:_** Find the official NeMo documentation at \n",
    "https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/text_normalization/wfst/intro.html "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Overview\n",
    "<img src=\"https://raw.githubusercontent.com/NVIDIA/NeMo/main/tutorials/text_processing/images/task_overview.png\" width=\"600\"/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "F-IrnmXMTevr"
   },
   "source": [
    "A sentence can be split up into semiotic tokens stemming from a varity of classes, where the spoken form differs from the written form. Examples are *dates*, *decimals*, *cardinals*, *measures* etc. The good TN or ITN system will be able to handle a variety of **semiotic classes**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "-IT1Xr9iW2Xr"
   },
   "source": [
    "# How to use\n",
    "## 1. Installation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Install NeMo, which installs both nemo and nemo_text_processing package\n",
    "BRANCH = 'main'\n",
    "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[nlp]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# try to import of nemo_text_processing an other dependencies\n",
    "import nemo_text_processing\n",
    "import os"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Text Normalization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Bfs7fa9lXDDh"
   },
   "outputs": [],
   "source": [
    "# create text normalization instance that works on cased input\n",
    "from nemo_text_processing.text_normalization.normalize import Normalizer\n",
    "normalizer = Normalizer(input_case='cased', lang='en')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# the normalizer class offers the following parameterization. \n",
    "print(normalizer.__doc__)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **_NOTE:_** Standard Text Normalization uses `determistic=True`, outputting a single output for a given input string\n",
    "\n",
    "\n",
    "\n",
    "#### 2.1 Run TN on input string"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Normalizer.normalize() offers the following parameterization\n",
    "print(normalizer.normalize.__doc__)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# run normalization on example string input\n",
    "written = \"We paid $123 for this desk.\"\n",
    "normalized = normalizer.normalize(written, verbose=True, punct_post_process=True)\n",
    "print(normalized)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "intermediate semtiotic class information is shown if verbose=True."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "### 2.1 Run TN on list of input strings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "UD-OuFmEOX3T"
   },
   "outputs": [],
   "source": [
    "# create temporary data folder and example input file\n",
    "DATA_DIR = 'tmp_data_dir'\n",
    "os.makedirs(DATA_DIR, exist_ok=True)\n",
    "INPUT_FILE = f'{DATA_DIR}/inference.txt'\n",
    "! echo -e 'The alarm went off at 10:00a.m. \\nI received $123' > $INPUT_FILE"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "d4T0gXHwY3JZ"
   },
   "outputs": [],
   "source": [
    "# check input file was properly created\n",
    "! cat $INPUT_FILE"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# load input file into 'data' - a list of strings\n",
    "data = []\n",
    "with open(INPUT_FILE, 'r') as fp:\n",
    "    for line in fp:\n",
    "        data.append(line.strip())\n",
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "F5wSJTI8ZFRg"
   },
   "outputs": [],
   "source": [
    "# run normalization on 'data'\n",
    "normalizer.normalize_list(data, punct_post_process=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "RMT5lkPYzZHK"
   },
   "source": [
    "### 2.2 Evaluate TN on written-normalized text pairs \n",
    "\n",
    "The evaluation data needs to have the following format:\n",
    "\n",
    "'on 22 july 2022 they worked until 12:00' and the normalization is represented as "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# example evaluation sentence\n",
    "eval_text =  \"\"\"PLAIN\\ton\\t<self>\n",
    "DATE\\t22 july 2012\\tthe twenty second of july twenty twelve\n",
    "PLAIN\\tthey\\t<self>\n",
    "PLAIN\\tworked\\t<self>\n",
    "PLAIN\\tuntil\\t<self>\n",
    "TIME\\t12:00\\ttwelve o'clock\n",
    "<eos>\\t<eos>\n",
    "\"\"\"\n",
    "EVAL_FILE = f'{DATA_DIR}/eval.txt'\n",
    "with open(EVAL_FILE, 'w') as fp:\n",
    "    fp.write(eval_text)\n",
    "! cat $EVAL_FILE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "RMT5lkPYzZHK"
   },
   "source": [
    "That is, every sentence is broken into semiotic tokens line by line and concluded by end of sentence token `<eos>`. In case of a plain token it's `[SEMIOTIC CLASS] [TAB] [WRITTEN] [TAB] <self>`, otherwise `[SEMIOTIC CLASS] [TAB] [WRITTEN] [TAB] [NORMALIZED]`.\n",
    "This format was introduced in [Google Text normalization dataset](https://arxiv.org/abs/1611.00068). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Parse evaluation file into written and normalized sentence pairs\n",
    "from nemo_text_processing.text_normalization.data_loader_utils import load_files, training_data_to_sentences\n",
    "eval_data = load_files([EVAL_FILE])\n",
    "sentences_un_normalized, sentences_normalized, sentences_class_types = training_data_to_sentences(eval_data)\n",
    "print(list(zip(sentences_un_normalized, sentences_normalized)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# run prediction\n",
    "sentences_prediction = normalizer.normalize_list(sentences_un_normalized)\n",
    "print(sentences_prediction)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# measure sentence accuracy\n",
    "from nemo_text_processing.text_normalization.data_loader_utils import evaluate\n",
    "sentences_accuracy = evaluate(\n",
    "            preds=sentences_prediction, labels=sentences_normalized, input=sentences_un_normalized\n",
    "        )\n",
    "print(\"- Accuracy: \" + str(sentences_accuracy))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Inverse Text Normalization\n",
    "ITN supports equivalent API as TN. Here we are only going to show inverse normalization on input string"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create inverse text normalization instance\n",
    "from nemo_text_processing.inverse_text_normalization.inverse_normalize import InverseNormalizer\n",
    "inverse_normalizer = InverseNormalizer(lang='en')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# run ITN on example string input\n",
    "spoken = \"we paid one hundred twenty three dollars for this desk\"\n",
    "un_normalized = inverse_normalizer.inverse_normalize(spoken, verbose=True)\n",
    "print(un_normalized)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. Audio-based Text Normalization\n",
    "Audio-based text normalization uses extended [WFST](https://en.wikipedia.org/wiki/Finite-state_machine) grammars to provide a range of possible normalization options.\n",
    "The following example shows the workflow: (Disclaimer: exact values in graphic do not need to be real system's behavior)\n",
    "1. text \"627\" is sent to extended TN WFST grammar\n",
    "2. grammar output 5 different options of verbalization based on text input alone\n",
    "3. in case an audio file is presented we compare the audio transcript with the verbalization options to find out which normalization is correct based on character error rate. The transcript is generated using a pretrained NeMo ASR model. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://raw.githubusercontent.com/NVIDIA/NeMo/main/tutorials/text_processing/images/audio_based_tn.png\" width=\"600\"/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following shows an example of how to generate multiple normalization options:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# import non-deterministic WFST-based TN module\n",
    "from nemo_text_processing.text_normalization.normalize_with_audio import NormalizerWithAudio"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# initialize normalizer, this may take some time to generate the extended grammars. \n",
    "# Thus, we recommend to cache the grammars by specifying a cache directory\n",
    "normalizer = NormalizerWithAudio(\n",
    "        lang=\"en\",\n",
    "        input_case=\"cased\",\n",
    "        overwrite_cache=False,\n",
    "        cache_dir=\"cache_dir\",\n",
    "    )\n",
    "# create up to 10 normalization options\n",
    "print(normalizer.normalize(\"123\", n_tagged=10, punct_post_process=True))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Parallel execution\n",
    "\n",
    "`Normalizer.normalize()` as well as `InverseNormalizer.inverse_normalize()` are functions without side effect.\n",
    "Thus, if you need to normalize large amounts of input examples, these can be executed in parallel."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ENMDNl9C4TkF"
   },
   "source": [
    "# Tutorial on how to customize grammars\n",
    "\n",
    "https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/text_processing/WFST_Tutorial.ipynb\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "lcvT3P2lQ_GS"
   },
   "source": [
    "# References and Further Reading:\n",
    "\n",
    "\n",
    "- [Zhang, Yang, Bakhturina, Evelina, Gorman, Kyle and Ginsburg, Boris. \"NeMo Inverse Text Normalization: From Development To Production.\" (2021)](https://arxiv.org/abs/2104.05055)\n",
    "- [Ebden, Peter, and Richard Sproat. \"The Kestrel TTS text normalization system.\" Natural Language Engineering 21.3 (2015): 333.](https://www.cambridge.org/core/journals/natural-language-engineering/article/abs/kestrel-tts-text-normalization-system/F0C18A3F596B75D83B75C479E23795DA)\n",
    "- [Gorman, Kyle. \"Pynini: A Python library for weighted finite-state grammar compilation.\" Proceedings of the SIGFSM Workshop on Statistical NLP and Weighted Automata. 2016.](https://www.aclweb.org/anthology/W16-2409.pdf)\n",
    "- [Mohri, Mehryar, Fernando Pereira, and Michael Riley. \"Weighted finite-state transducers in speech recognition.\" Computer Speech & Language 16.1 (2002): 69-88.](https://cs.nyu.edu/~mohri/postscript/csl01.pdf)"
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "collapsed_sections": [
    "lcvT3P2lQ_GS"
   ],
   "name": "Text_Normalization_Tutorial.ipynb",
   "private_outputs": true,
   "provenance": [],
   "toc_visible": true
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
